FILTER MODE ACTIVE

#vision-language models

Records found: 12

#vision-language models11/09/2025

From Scans to Searchable Text: Top Open-Source OCR Models Explained

'Overview of leading open-source OCR models and guidance on selecting the right option for printed, handwritten, or multimodal documents.'

READ →

#vision-language models31/08/2025

Alibaba Unveils GUI-Owl and Mobile-Agent-v3: AI Agents That Automate Any Interface

'Alibaba Qwen team introduced GUI-Owl and Mobile-Agent-v3, a unified multimodal agent and multi-agent framework that automates GUI tasks across mobile and desktop with state-of-the-art benchmark performance.'

READ →

#vision-language models21/08/2025

LFM2-VL: Liquid AI's Ultra-Fast, Open-Weight Vision-Language Models for On-Device Use

'Liquid AI unveils LFM2-VL, two open-weight vision-language models optimized for fast, low-latency on-device inference, offering 450M and 1.6B variants and easy integration via Hugging Face.'

READ →

#vision-language models18/07/2025

Mirage: Enabling Visual Reasoning in Vision-Language Models Without Image Generation

Mirage introduces a new method for vision-language models to integrate visual reasoning without generating images, significantly enhancing their ability to solve spatial and multimodal tasks.

READ →

#vision-language models10/07/2025

Google Unveils MedGemma 27B and MedSigLIP: Open-Source Breakthroughs in Multimodal Medical AI

Google has open-sourced MedGemma 27B Multimodal and MedSigLIP, cutting-edge models designed for scalable multimodal medical reasoning and efficient healthcare AI applications.

READ →

#vision-language models28/05/2025

How Poor Product Data Is Harming Fashion and the Role of AI in Fixing It

Poor product data in fashion leads to lost sales, increased returns, and customer frustration. Multimodal AI offers a scalable solution to improve data accuracy and streamline retail operations.

READ →

#vision-language models09/05/2025

X-Fusion: Enhancing Frozen Language Models with Vision Without Sacrificing Language Skills

X-Fusion introduces a dual-tower architecture that adds vision capabilities to frozen large language models, preserving their language skills while improving multimodal performance in image understanding and generation.

READ →

#vision-language models08/05/2025

Enkrypt AI Reveals Critical Safety Flaws in Cutting-Edge Vision-Language Models

Enkrypt AI’s report reveals serious safety flaws in Mistral’s vision-language models that enable generation of harmful content, urging continuous security improvements in multimodal AI systems.

READ →

#vision-language models03/05/2025

Unlocking Business Potential with Vision Foundation Models: Practical Implementations

This tutorial covers hands-on implementations of four key vision foundation models—CLIP, DINO v2, SAM, and BLIP-2—highlighting their business applications from product classification to marketing content analysis.

READ →

#vision-language models29/04/2025

UniME: Advancing Multimodal Representations with a Two-Stage MLLM Framework

UniME introduces a two-stage framework that significantly improves multimodal representation learning by leveraging textual knowledge distillation and hard negative instruction tuning, outperforming existing models on multiple benchmarks.

READ →

#vision-language models23/04/2025

The Hidden Costs of Flawed AI Dataset Annotations Revealed

A recent study reveals how errors in AI dataset annotations distort the evaluation of vision-language models, advocating for improved human labeling practices to enhance model reliability and reduce hallucinations.

READ →

#vision-language models23/04/2025

NVIDIA Unveils Describe Anything 3B: Advanced Multimodal Model for Precise Image and Video Captioning

NVIDIA introduces Describe Anything 3B, a multimodal large language model that excels in detailed, region-specific captioning for images and videos, outperforming existing models on multiple benchmarks.

READ →